A system for voice conversion based on adaptive filtering and line spectral frequency distance optimization for text-to-speech synthesis
نویسندگان
چکیده
This paper proposes a new voice conversion algorithm that modifies the source speaker’s speech to sound as if produced by a target speaker. To date, most approaches for speaker transformation are based on mapping functions or codebooks. We propose a linear filtering based approach to the problem of mapping the spectral parameters of one speaker to those of the other. In the proposed method, the transformation is performed by filtering the source speaker’s Line Spectral Pair (LSP) frequencies to obtain the LSP frequency estimates of the target speaker. Speech signal is timealigned into a sequence of HMM states. The filters are designed for each HMM state using the aligned data. We consider two methods for spectral conversion. A linear transformation for the LSP’s was obtained using the adaptive steepest gradient descent approach. Mean values of LSP’s are adjusted to match those of the target speaker. In order to prevent the LSP vectors from resulting in unstable vocal tract filters, weighted least square estimation is used. This approach optimizes differences between source and target LSP’s. Weights are inverses of the source LSP variances. This approach is integrated into a Time Domain Pitch Synchronous Overlap and Add (TD-PSOLA) analysis-synthesis framework. The algorithm is objectively evaluated using a distance measure based on the log-likelihood ratio of observing the input speech, given Gaussian mixture speaker models for both the source and the target voice. Results using the Gaussian mixture model formulated criteria demonstrate consistent transformation using a 5 speaker database. The algorithm offers promise for rapidly adapting textto-speech systems to new voices.
منابع مشابه
Using Context-based Statistical Models to Promote the Quality of Voice Conversion Systems
This article aims to examine methods of optimizing GMM-based voice conversion systems performance in which GMM method is introduced as the basic method for improvement of voice conversion systems performance. In the current methods, due to using a single conversion function to convert all speech units and subsequent spectral smoothing arising from statistical averaging, we will observe quality ...
متن کاملStudy on Unit-Selection and Statistical Parametric Speech Synthesis Techniques
One of the interesting topics on multimedia domain is concerned with empowering computer in order to speech production. Speech synthesis is granting human abilities to the computer for speech production. Data-based approach and process-based approach are the two main approaches on speech synthesis. Each approach has its varied challenges. Unit-selection speech synthesis and statistical parametr...
متن کاملCodec integrated voice conversion for embedded speech synthesis
Voice conversion technologies transform individual characteristics of speech patterns while preserving the original content, and can be widely used in speech processing. Considering limited system resources, in particular, of embedded concatenative speech synthesis, voice conversion may reduce the memory consumption of the acoustic database. Voice conversion enables the intra-gender or cross-ge...
متن کاملSpeech Enhancement by Modified Convex Combination of Fractional Adaptive Filtering
This paper presents new adaptive filtering techniques used in speech enhancement system. Adaptive filtering schemes are subjected to different trade-offs regarding their steady-state misadjustment, speed of convergence, and tracking performance. Fractional Least-Mean-Square (FLMS) is a new adaptive algorithm which has better performance than the conventional LMS algorithm. Normalization of LMS ...
متن کاملSpeaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation
This paper describes a novel approach based on voice conversion (VC) to speaker-adaptive speech synthesis for speech-tospeech translation. Voice quality of translated speech in an output language is usually different from that of an input speaker of the translation system since a text-to-speech system is developed with another speaker’s voices in the output language. To render the input speaker...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003